The 3D RNA-seq pipeline includes complex steps of expression data generation, data pre-processing, 3D analysis and result summary. This manual provides step-by-step instructions to use the 3D RNA-seq App. To use this App in your work, please cite:
Command-line based 3D analysis manual:ThreeDRNAseq R package can be installed from Github by using devtools R package
##################################################################################################
## use devtools R package to install ThreeDRNAseq from Github
###---> If devtools is not installed, please install
if(!requireNamespace("devtools", quietly = TRUE))
install.packages('devtools')
###---> Install ThreeDRNAseq
devtools::install_github('wyguo/ThreeDRNAseq')
#######################################################################################################
## Install packages of dependency
###---> Install packages from Cran
cran.package.list <- c('shiny','shinydashboard','rhandsontable','shinyFiles','shinyjs','DT',
'plotly','ggplot2','eulerr','ggrepel',
'gridExtra','fastcluster','Gmisc','rmarkdown')
for(i in cran.package.list){
if(!(i %in% rownames(installed.packages()))){
message('Installing package: ',i)
install.packages(i,dependencies = T)
} else next
}
###---> Install packages from Bioconductor
bioconductor.package.list <- c('tximport','edgeR','limma','RUVSeq','ComplexHeatmap','rhdf5')
for(i in bioconductor.package.list){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if(!(i %in% rownames(installed.packages()))){
message('Installing package: ',i)
BiocManager::install("tximport", version = "3.8")
} else next
}
Please type the command in your RStudio and the App will be launched in default web browser (Google Chrome is recommended for image visualisation issues).
library(ThreeDRNAseq)
ThreeDRNAseq.app()
Example data can be downloaded from: xxx
The example RNA-seq data is a subset from study of Arabidopsis in response to cold (Calixto et al., 2018). Datasets of three time-points (red circles) were extracted from the whole study:
| Time-points | Description |
|---|---|
| T2 | Second time-point of Day 0, temperature \(20^oC\) |
| T10 | Second time-point of Day 1, temperature \(4^oC\) of cold transition |
| T19 | Second time-point of Day 4, temperature \(4^oC\) of cold acclimation |
Each time-point has 3 biological replicates and each biological replicate has 3 sequencing replicates. Transcript quantification was generated by using Salmon (Patro et al., 2017) and AtRTD2-QUASI (Zhang et al., 2017) as transcriptome reference. Expression comparisons were designed as “T10 vs T2” and “T19-T2” to identify the cold response genes and transcripts at both transcription and AS level.
Data download links:
The following datasets are required in the pipeline (Figure 1):
Note: the analysis will read information according to the column names in the spreadsheets. Please use the same column names as in examples.
Figure 1: User provided datasets.
Four folders will be created in the working directory to save the analysis results:
The detailed descriptions of files saved in the above folders can be found in Section “Saved files in local directory”.
To begin with, users can choose a working directory to save the output data, results and figures.
Figure 2: Choose working directory. Please click the “triangles” in the left-side “Directories” box to access and select sub-folders. The click and select in the right-side “Content” box is invalid.
The whole pipeline has a number of steps and each step is related to several intermediate datasets (see Section “Output data”). This table summarise the generation and usage steps of these datasets and whether (True or False) they have been loaded in workspace ready for use. To avoid running the pipeline from very beginning every time the App get refreshed, the following options are provided:
Figure 3: Data information for the analysis. The columns:
The csv files “mapping.csv” and “samples.csv” (see Figure 1A and B) are loaded (Figure 4A) and visualised (Figure 4C) in this step. The tximport R package (Soneson et al., 2016) is used to generate gene and transcript expression of read counts and TPMs (transcript per million reads) based on three options: “no” (default), “scaledTPM” and “lengthScaledTPM”. The “lengthScaledTPM” is recommended since it scales the expression using the transcript length and then the library size over samples (see the tximport documentation for details). The generated gene and transcript expression will be save as R data object txi_genes.RData and txi_trans.RData in the “data” folder, respectively. Spreadsheets of read counts and TPMs will be saved as csv files in the “result” folder.
Figure 4: Generate gene and transcript expression.
Once the read counts and TPMs are generated, the data is pre-processed with steps: Merge sequencing replicates (if exist), Remove low expressed transcripts/genes, Estimate batch effects (if exist) and Data normalisation. In each step, quality control plots are generated to optimise the parameters for pre-processing (Figure 5).
Figure 5: Data pre-processing. The figures were generated based on the RNA-seq data from study of Arabidopsis in response to cold (Calixto et al., 2018).
The sequencing replicates are generated by sequencing the same biological replicate multiple times, which do not add too much information to the technical variation. Sequencing replicates from the same biological replicate are often added to increase the depth. In the panel of Figure 6A, the sequencing replicate information will be extracted from the “srep” column of the “samples.csv” file ( Figure 1B) to indicate whether to merge the sequencing replicates.
The low expressed transcripts are filtered based on count per million (CPM) reads: \[ CPM=\frac{X_i}{N}\times 10^6 \] where \(X_i\) is read count of transcript \(i\) and \(N\) is the library size of the sample.
The sample number cut-off \(n\) and CPM cut-off \(m\) are determined by the mean-variance trend plot (Figure 6B and Figure 6C). Read counts are assumed to be negative binomial distributed of which the mean and variance has a relation: \[ Var(\log_2X)=\frac{1}{\mu}+\phi \] where \(X\) is read count, \(\mu\) is the mean and \(\phi\) is the overdispersion. The expression variance decreasing monotonically with the increasing of the mean. In real RNA-seq data, the low expressed transcripts confound with noise and the expression distribution is different to the expressed transcripts. In the mean-variance trend plot, the low expressed transcripts cause a drop trend in low expression region (Figure 5 and Figure 6C). Therefore, the cut-offs \(n\) and \(m\) to filter the low expression can be optimised until the drop trend in low expression region disappeared in the mean-variance trend plot.
Figure 6: Merge sequencing replicates and filter low expressed genes/transcripts. (D) is the scatter plot of mean-variance trends before and after filtering the low expressed transcripts. The fitted lines of the trends are also presented (Yellow line: before filtering; Red line: after filtering).
PCA is a mathematical method to reduce expression data dimensionality while remaining most of the data variance. The reduction is performed by projecting the data to directions or principal components (PCs) from highest to lowest data variability. Therefore, the data main variance is accessible by investigate top few number of PCs rather than thousands of variables. In Omic data analysis, the first two to four principal components are often used to visualise the similarities and differences of samples, thereby determining the grouping of samples. In the PCA scatter plot, the samples from the same condition often stay close, but far from the samples of other conditions. It can be used as evidence for data quality checking.
PCA plot also can be used to identify whether the RNA-seq data contain batch effects, which are caused by biological replications in different laboratory conditions. Compared with random noise, batch effects can be distinguished owning to the systematic biases of signals across replicates. For example, Figure 7A shows the PCA plot (PC1 vs PC2) of a RNA-seq data with 3 time-points T2, T10 and T19. Each time-point has 3 biological replicates. It can be seen that the first biological replicate (brep1) stays in a separate cluster to other two replicates, which indicates a clear batch effect of the data. In this pipeline, the RUVSeq R package (Risso et al., 2014) is used to estimate the batch effects. The RUVSeq algorithm generates batch effect terms which can be incorporated in the design matrix of linear regression model for 3D analysis, i.e.
\[ \text{observed expression = baseline effects + batch effects + noise} \]
It also generates a pseudo read count matrix in which the batch effects have been removed from the data. To avoid data over-modification, this matrix is only used to make PCA plot, but not used for downstream 3D analysis (Figure 7B).
In this panel (Figure 7), users can select and visualise different PCs based on transcript level or gene level expression, or the average expression of biological replicates. The scatter points can be grouped and coloured according to biological replicates or conditions. Ellipses or polygons can be added to the plots to highlight the grouped clusters. Three options, RUVr, RUVs and RUVg, in the RUVSeq R package are provided to estimate the batch effects (see the package documentation for details). Users can “Update PCA plot” to visualise whether the batch effects are removed properly. If there are no clear batch effects in the PCA plot, please skip this step.
Figure 7: PCA plot before (A) and after (B) remove batch effects.
To enable unbiased comparisons across samples, read counts must be normalised on the basis of sequencing depths. Normalisation methods Trimmed Mean of M-values (TMM), Relative Log Expression (RLE) and upper-quartile can be used to reduce the effects from the systematic technical biases across samples (Bullard et al., 2010). Box plots are used to visualise the expression distribution of raw read counts and normalised expression across samples.
Figure 8: Data normalisation.
A gene/transcript is identified as DE in a contrast group if \(L_2FC\) of expression \(\geq\) cut-off and with adjusted p-value < cut-off.
Two pipelines are provided for DE analysis: “limma-voom” and “edgeR”. The “edgeR” pipeline includes two methods: “glmQL” (Genewise Negative Binomial Generalized Linear Models with Quasi-likelihood Tests) and “glm” (Genewise Negative Binomial Generalized Linear Models). In limma pipeline, \(L_2FCs\) are calculated based count per million reads (CPMs) while in edgeR, \(L_2FCs\) are based on read counts. From several real RNA-seq data analyses, high consensus is achieved between “limma-voom” and “glmQL” (>90%) and they have more stringent controls of false discovery rate than the “glm” method.
To test differential alternative splicing (AS), the gene level expression is compared to transcript level expression in the contrast groups. A gene is DAS in a contrast group if adjusted p-value < cut-off and at least one transcript of the gene with \(\Delta PS \geq\) cut-off. A transcript is DTU if adjusted p-value < cut-off and \(\Delta PS \geq\) cut-off.
Two pipelines for AS analysis:
To perform the analysis, the contrast groups in csv file (Figure 1D) can be loaded in panel Figure 9A. 3D analysis pipeline, \(\Delta PS\) and cut-offs of test statistics can be selected/generated in panels Figure 9B and C. Subset of \(\Delta PS\) can be visualised in panel Figure 9D.
Figure 9: 3D RNA-seq analysis panels.
This page includes panels to show the 3D analysis results (Figure 10-12):
Figure 10: Number of DE genes, DAS genes, DE transcripts and DTU transcripts, and bar plots of up-down regulation numbers.
Figure 11: Euler diagram of numbers between contrast groups (A) and numbers to compare DE vs DAS genes and DT vs DTU transcripts (B).
Figure 12: Flow charts to show numbers of DE and/or DAS genes and DE and/or DTU transcripts.
Users can make heatmap for DE genes, DAS genes, DE transcripts and DTU transcripts identified from the analysis (Figure 13A). Users can also upload a gene or transcript list in csv file to make heatmap for specific list (Figure 13B). To make the heatmap, the average TPMs of conditions for the targets are standardized into z-scores. In corresponding plots, the DAS only genes or DTU only transcripts are excluded since they only have AS changes rather than significant abundance changes. The dist and hclust R functions are used to perform the hierarchical clustering analysis. The heatmaps and the target list in each cluster of the heatmaps can be saved to local folder.
Figure 13: Heatmap panel (A) and example of user provided gene list in csv file to make heatmap (B). To avoid confliction, if the radio button “Select target” is selected, the “Or Choose target list csv file” function is enabled; If the radio button “Upload target list” is selected, the “Select targets” function is enabled.
Gene and transcript expression profiles (TPMs or read counts) and PS (percent spliced) can be visualised by typing a gene name (Figure 14A). Users can also generate a number of plots by provides a gene list (Figure 13B) and all the plots will be saved to “Multiple profile plot” folder in the “figure” folder of working directory (Figure 14B).
Figure 14: Plot of expression profiles and PS across conditions
Users can generate DE and DAS gene list by click “Generate” button (Figure 15B). These gene lists can be loaded to Gene Ontology (GO) analysis tools/databases (e.g. DAVID and agriGO) to generate GO annotation. A csv file with GO annotation information is required to generate the annotation plot (Figure 15B). The file includes a “Category” column of CC (cellular component), BP (biological process) and MF (molecular function), a column of “Term” of GO annotation and columns of statistics to report the annotation enrichment, e.g. count, FDR, -log10(FDR), etc. (Figure 15A). In the panel, users can select different statistics to visualise. The selected gene list type will present in the plot title to distinguish whether the provided gene list is DE or DAS genes.
Figure 15: GO annotation plot panel.
The table in this page lists the parameters for 3D analysis and the locations of saved files, which will be used to generate the report. If the information is wrong, users can double click the cells in the table to correct to values. Based on this information, reports in three formats, word, pdf and html, will be generated and saved in the “report” folder of working directory.
Figure 16: Generate report panel.
| File names | Description |
|---|---|
| contrast.RData | Contrast group |
| DAS_genes.RData | Testing statistics of DAS genes. |
| DE_genes.RData | Testing statistics of DE genes. |
| DE_trans.RData | Testing statistics of DE transcripts. |
| deltaPS.RData | Matrix of deltaPS values. |
| DTU_trans.RData | Testing statistics of DTU transcripts. |
| genes_3D_stat.RData | The returned object of DE gene analysis by using limma.pipeline/edgeR.pipeline functions in ThreeDRNAseq R package. |
| genes_batch.RData | The returned object of gene level batch effect estimation by using RUVSeq R package. |
| genes_dge.RData | The returned object of gene expression normalization by using calcNormFactors function in edgeR. |
| mapping.RData | Matrix of transcript-gene mapping. |
| PS.RData | Matrix of transcript percent spliced (PS), which is used to generate deltaPS based on contrast groups. |
| samples.RData | The sample information before merging sequencing replicates. |
| samples_new.RData | The sample information after merging sequencing replicates. If there are no sequencing replicates, the “sample_new” object has the same number of rows with the “samples” object. |
| target_high.RData | A list object with two element: expressed transcripts (trans_high) and expressed genes (genes_high). |
| trans_3D_stat.RData | The returned object of DAS gene, DE and DTU transcripts analysis by using limma.pipeline/edgeR.pipeline functions in ThreeDRNAseq R package. |
| trans_batch.RData | The returned object of transcript level batch effect estimation by using RUVSeq R package.. |
| trans_dge.RData | The returned object of transcript expression normalization by using calcNormFactors function in edgeR. |
| txi_genes.RData | The returned object of gene expression by using tximport R package. |
| txi_trans.RData | The returned object of transcript expression by using tximport R package. |
| File.names | Description |
|---|---|
| counts_genes.csv | Gene level raw read counts |
| counts_trans.csv | Transcript level raw read counts |
| DAS genes.csv | Testing statistics of DAS genes. |
| data.info.csv | Data information during data pre-processing. |
| DE DAS DTU numbers.csv | Numbers of 3D genes/transcripts in contrast groups. |
| DE genes.csv | Testing statistics of DE genes. |
| DE transcripts.csv | Testing statistics of DE transcripts. |
| DE vs DAS gene number.csv | Numbers of DE vs DAS genes in contrast groups. |
| DE vs DTU transcript number.csv | Numbers of DE vs DTU transcripts in contrast groups. |
| DTU transcripts.csv | Testing statistics of DTU transcripts. |
| Parameter summary.csv | Methods/Parameters/Cut-offs used for 3D RNA-seq analysis. |
| Target in each cluster heatmap * DE&DAS genes.csv | DE&DAS gene lists in clusters of DAS gene heatmap. The DASonly genes are excluded since they have no significant abundance changes across samples. |
| Target in each cluster heatmap * DE genes.csv | DE gene list in clusters of DE gene heatmap. |
| Target in each cluster heatmap * DE trans.csv | DE&DTU transcript lists in clusters of DTU transcript heatmap. The DTUonly transcripts are excluded since they have no significant abundance changes across samples. |
| Target in each cluster heatmap * DE&DTU trans.csv | DE transcript list in clusters of DE transcript heatmap. |
| TPM_genes.csv | Gene level raw TPMs |
| TPM_trans.csv | Transcript level raw TPMs |
| File.names | Description |
|---|---|
| DAS genes GO annotation plot.png/.pdf | DAS genes GO annotation plot |
| DAS genes updown regulation numbers.png/.pdf | DAS genes updown regulation numbers |
| DE genes GO annotation plot.png/.pdf | DE genes GO annotation plot |
| DE genes updown regulation numbers.png/.pdf | DE genes updown regulation numbers |
| DE transcripts updown regulation numbers.png/.pdf | DE transcripts updown regulation numbers |
| DTU transcripts updown regulation numbers.png/.pdf | DTU transcripts updown regulation numbers |
| Gene data distribution.png/.pdf | Gene data distribution |
| Gene mean-variance trend.png/.pdf | Gene mean-variance trend |
| Gene PCA Average expression.png/.pdf | Gene PCA Average expression |
| Gene PCA batch effect removed Bio-reps.png/.pdf | Gene PCA batch effect removed Bio-reps |
| Gene PCA Bio-reps.png/.pdf | Gene PCA Bio-reps |
| Heatmap DAS genes.png/.pdf | Heatmap DAS genes |
| Heatmap DE genes.png/.pdf | Heatmap DE genes |
| Heatmap DE transcripts.png/.pdf | Heatmap DE transcripts |
| Heatmap DTU transcripts.png/.pdf | Heatmap DTU transcripts |
| Transcript data distribution.png/.pdf | Transcript data distribution |
| Transcript mean-variance trend.png/.pdf | Transcript mean-variance trend |
| Transcript PCA Average expression.png/.pdf | Transcript PCA Average expression |
| Transcript PCA batch effect removed Bio-reps.png/.pdf | Transcript PCA batch effect removed Bio-reps |
| Transcript PCA Bio-reps.png/.pdf | Transcript PCA Bio-reps |
| Union set DE genes vs DAS genes.png/.pdf | Flow chart -Union set DE genes vs DAS genes |
| Union set DE transcripts vs DTU transcripts.png/.pdf | Flow chart -Union set DE transcripts vs DTU transcripts |
| File.names | Description |
|---|---|
| report.docx | word format |
| report.html | html format |
| report.pdf | pdf format |
Bray,N.L., Pimentel,H., Melsted,P., and Pachter,L. (2016) Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol., 34, 525–527.
Bullard,J.H., Purdom,E., Hansen,K.D., and Dudoit,S. (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11, 94.
Calixto,C.P.G., Guo,W., James,A.B., Tzioutziou,N.A., Entizne,J.C., Panter,P.E., Knight,H., Nimmo,H.G., Zhang,R., and Brown,J.W.S. (2018) Rapid and dynamic alternative splicing impacts the Arabidopsis cold response transcriptome. bioRxiv, 251876.
Law,C.W., Chen,Y., Shi,W., and Smyth,G.K. (2014) voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 15, R29.
Patro,R., Duggal,G., Love,M.I., Irizarry,R.A., and Kingsford,C. (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods, 14, 417–419.
Risso,D., Ngai,J., Speed,T.P., and Dudoit,S. (2014) Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol., 32, 896–902.
Ritchie,M.E., Phipson,B., Wu,D., Hu,Y., Law,C.W., Shi,W., and Smyth,G.K. (2015) limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res, 43, e47.
Robinson,M., Mccarthy,D., Chen,Y., and Smyth,G.K. (2011) edgeR : differential expression analysis of digital gene expression data User ’ s Guide. Most, 23, 1–77.
Soneson,C., Love,M.I., and Robinson,M.D. (2016) Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research, 4, 1521.
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.5.1 backports_1.1.2 magrittr_1.5 rprojroot_1.3-2
## [5] tools_3.5.1 htmltools_0.3.6 yaml_2.2.0 Rcpp_0.12.18
## [9] stringi_1.1.7 rmarkdown_1.10 knitr_1.20 stringr_1.3.1
## [13] digest_0.6.17 evaluate_0.11